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Abstract 

Streaming interactive proofs (SIPs) are a framework to reason about outsourced computation, where a data owner 
(the verifier) outsources a computation to the cloud (the prover), but wishes to verify the correctness of the solution 
provided by the cloud service. In this paper we present streaming interactive proofs for problems in data analysis. We 
present protocols for clustering and shape fitting problems, as well as an improved protocol for rectangular matrix 
multiplication. The latter can in turn be used to verify k eigenvectors of a (streamed) nxn matrix. 

In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication 
and verifier space. For special cases (when optimality certificates can be verified easily), we present constant round 
protocols with similar costs. For rectangular matrix multiplication and eigenvector verification, our protocols work in 
the more restricted annotated data streaming model, and use sublinear (but not polylogarithmic) communication. 


1 Introduction 

There are now many third party “cloud” services (from companies like Amazon, Google and Microsoft) that can 
perform intensive computational tasks on large data. Computing effort is split between a computationally weak “client” 
who owns the data and wishes to solve a desired task, and a “server” consisting of a cluster of computing nodes that 
performs the computation. 

In this setting, how does a client verify that a computation has been performed correctly? The client here will 
have limited (streaming) access to the data, as well as limited ability to talk to the server (measured by the amount 
of communication and rounds). Recently, there has been renewed interest in studying interactive verification with 
extremely limited sublinear space (or streaming) verifiers. Such streaming interactive proofs (SIPs) have been developed 
for classic problems in streaming, like frequency moment estimation and related graph problems. 

1.1 Our Contributions 

We initiate a study of streaming interactive proofs for problems in data analysis. In what follows, we will refer to both 
SIPs and annotated streaming protocols which are a variant of SIPs (we discuss the models and their differences in 
Section]^. 

Matrix Analysis. We present an annotated data streaming protocol (Section!^ for rectangular matrix multiplication 
over any field F. Specifically, given input matrices A G and B G F"^ , our protocol computes their product, 
using communication cost k-k' ■ hlog |F| and space cost vlog |F|, for any desired pair of positive integers h, v satisfying 
h-v>n. This improves on prior work (HI by a factor of k in the space cost, and we prove that this tradeoff is optimal up 
to a factor of O (min {k,k')). The rectangular matrix multiplication protocol can in turn be used to verify k (approximate) 
eigenvectors of an n x n integer matrix A 

Shape Analysis. We present a number of protocols for shape fitting and clustering problems, (i) We give 3-message 
SIPs that can verify a minimum enclosing ball (MEB) and the width of a point set exactly with polylogarithmic space 
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^School of Computing, University of Utah 
^ Yahoo Labs 

^School of Computing, University of Utah 

^ We cannot in general verify that the provided vectors are exact eigenvectors due to precision issues. Section 3 has details. 
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and communication costs. Note that the MEB cannot be approximated to better than a constant factor by a streaming 
algorithm with space even polynomial in the dimension |[Tj: we show that the streaming hardness of the MEB problem 
holds even when the points are chosen from a discrete cube: this is important because our interactive proofs require 
discrete input (Section]^, (ii) We present polylogarithmic round protocols with polylogarithmic communication and 
veriher space for verifying optimal A:-centers and A:-slabs in Euclidean space (note that computing the MEB and width 
of a point set correspond to the 1-center and 1-slab problems respectively) (Section]^, (iii) We also show a simple 
3-message protocol for verifying a 2-approximation to the A:-center in a metric space, via simple adaptation of the 
Gonzalez 2-approximation for A:-center (Section]^. 

Technical Overview. In our annotated data streaming protocol for matrix multiplication, we first observe that 
multiplying akxn matrix A with annx k’ matrix B is equivalent to performing k’ matrix-vector multiplications, one 
for each column of B. But rather than naively implement k' matrix-vector verification protocols (H)’ we exploit the fact 
that the k' matrix-vector multiplications are not independent, because the matrix A is held fixed in all of them. This 
leads to an improved subroutine for rectangular matrix multiplication that in turn allows us to verify eigenvectors of a 
matrix. 

For the A:-center and A:-slab problems, we must verify feasibility and optimality of a claimed solution. We verify 
feasibility by reducing to an instance of a Range Counting problem, for which a 2-message SIP exists 0 - For optimality, 
the prover must convince the verifier that no other feasible solution has lower cost. When k= 1, we show that there is a 
sparse witness of optimality, which the verifier can check directly using 3 messages, by reduction to Range Counting. 
For general k, we cannot produce such a witness. However, we observe that the “for-all” constraint on feasible solutions 
(that they all be costlier than the claimed solution) can be expressed as a sum over all solutions of potentially lower 
cost. Choosing a cost-based ordering of solutions converts this into a partial sum over a prefix of the ordered set of 
solutions. Our main tool is a way to verify such a sum in general, using polylogarithmically many messages, even when 
the relevant prefix is only known after the stream has passed. 

We note that while core sets are a natural witness for a property of a point set, they cannot always be computed by a 
streaming algorithm, nor is it clear that a claim of being a core set is easily verified. For the problems considered here, 
these issues preclude the use of a “simple” core set, requiring a more complex interactive protocol. 


1.2 Prior Work on Streaming Verification 

Chakrabarti et al. |[^|^ introduced the notion of annotations in data streams, whereby an all-powerful prover could 
provide annotations to a verifier in order to complete a stream computation. Cormode et al. introduced the model of 
Streaming Interactive Proofs (SIPs), which extends the annotated data streaming model to allow for multiple rounds of 
interaction between the prover and verifier. They introduced a streaming variant of the classical sum-check protocol | [22) , 
and used it to give logarithmic cost protocols for a variety of well-studied streaming problems. In subsequent works, 
protocols were developed in both models for graph problems and matrix-vector operations sparse streams 0, and 
were implemented |10j . Most recently, Chakrabarti et al. 0 developed streaming interactive proofs of logarithmic 
cost that worked in 0{\) rounds, making use of an interactive protocol for the Index problem. Lower bounds on 
the cost of SIPs and their variants have also been studied |[3]|4l[7l [T8l|20| |. These results make use of Arthur-Merlin 
communication complexity and related notions. There has also been work in the cryptography community on stream 
verification protocols that are secure only against cheating provers that run in polynomial time (e.g., 01^1^). The 
interested reader is referred to ||26l for a more detailed overview of the literature on models for stream verification. 


2 Preliminaries 

Models. We will work in the streaming interactive proof (SIP) model first proposed by Cormode et al. In this 
model, there are two players, the prover P and verifier V. The input consists of a stream T of n items from some universe. 
Let / be a function mapping a stream T to any finite set .5^. A k-message SIP for / works as follows. First, V and P 
read the input stream. During this phase, V computes some small secret state, which depends on T and V’s private 
randomness.Second, V and P then exchange k messages, after which V outputs a value in {-L}, where _L indicates 
that V is not convinced by P. 
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Any SIP for / must satisfy soundness and completeness. Completeness requires that there exists some prover 
strategy that causes the verifier to output fir) with probability 1 — for Sc < I /3. Soundness requires that for all 
prover strategies, the veriher outputs a value in {/(t) , _L} with probability 1 — for some Ss < 1/3. The values Sc and 
gj are referred to as the completeness and soundness errors|^ 

Annotated Data Streams. The annotated data streaming model of Chakrabarti et al. Q essentially corresponds to 
one-message SIPs0 

Costs. In a SIP, the goal is to ensure that V uses sublinear space and that the protocol uses sublinear communication 
(number of bits exchanged between V and P) after stream observation. We will also desire protocols in which V and P 
can run quickly. In our protocols, both V and P can execute the protocol in time quasilinear in the size of the input 
stream. 

Input Model. All of the protocols we consider can handle inputs specihed in a general data stream form. Each 
element of the stream is a tuple (/, 5), where each i lies in a data universe of size u, and 5 G {+1,-1}. Negative 
values of 5 model deletions. The data stream implicitly dehnes a frequency vector a = (ai,... ,a«), where a, is the sum 
of all 5 values associated with i in the stream. 

Discretization. The protocols we employ make extensive use of finite field arithmetic. In order to apply these 
techniques to geometric problems, we must assume that all input points are drawn from the discretized grid ^ = [mY 
as the data universe. Importantly, the costs of our protocols will depend only logarithmically on m, enabling the grid to 
be exceedingly hne while still yielding tractable costs. 

2.1 Protocols from Prior Work 

We will make use of three basic tools in our algorithms: Reed-Solomon hngerprints for testing vector equality, a 
two-message SIP of Chakrabarti et al. Q for the PointQuery problem, and the streaming sum-check protocol of 
Cormode et al. p^ . We summarize the main properties of these protocols here: for more details, the reader is referred 
to the original papers. 

2.1.1 Fingerprinting 

Theorem 2.1 (Reed-Solomon Fingerprinting). Suppose the input stream x specifies two vectors a, a' G Z“, guaranteed 
to satisfy |a,|, |a;| < u at the end of X. There is a streaming algorithm using 0{logu) space that satisfies the following 
properties: (i) If a = a!, then the algorithm outputs 1 with probability 1. (ii) If a a!, then the algorithm outputs 0 with 
probability at least 1 — 1 ju^. 

Proof Let F be a hnite held of prime order, satisfying 6u^ < |F| < u^. We view each entry of a and a' as an element 
F in the natural way. At the start of the stream, the streaming algorithm picks an a S F at random, and computes 
finger(a) -L-eH at ■ a' and finger(a' )=Lie[i]a\ ■ a‘ with a single streaming pass over x. The algorithm outputs 1 if 
and only if finger(a) = finger(a'). Property (i) clearly holds: if a = a', then the algorithm outputs 1 with probability 1. 
To see that Property (ii) holds, observe that finger(a) and finger(a') are univariate polynomial of degree at most u. If 
a 7 ^ a', these two polynomials are not equal. Property (ii) then follows, because any two distinct polynomials of degree 
at most u over F can agree on at most u inputs, yielding an error of at most 1/u^. □ 

^All of our protocols achieve perfect completeness and soundness error 1 /poly(n). 

^While the original model allowed P to interleave information with the stream, most known annotated streaming protocols do not do so, and are 
thus 1-message SIPs. 
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Input: V is given oracle access to a v-variate polynomial g over finite field F and an 
a eF. 

Goal: Determine whether H = E(a-,,...,x„)g{0,1}'' ^(^1 > • • • 

• In the first round, P computes the univariate polynomial 

gi(^i):= L 8{^hX2,---,Xv), 

jr2,....A,,G{0,l}'-> 

and sends gi to V. V checks that gi is a univariate polynomial of degree at most 
degi(g), and that/I = gi(0) +gi(l), rejecting if not. 

• V chooses a random element ri e F, and sends rj to P. 

• In the yth round, for 1 < ;■ < V, P sends to V the univariate polynomial 

Sj{^j)= L 8{ru---,rj-uXj,Xj+i,...,Xv). 

(xj+i,...,Av)e{0,l}''-' 

V checks that gj is a univariate polynomial of degree at most deg^ (g), and that 
8j-l{>'j-l}=8j{0)+8jW^ rejecting if not. 

• V chooses a random element rj e F, and sends rj to P. 

• In round v, P sends to V the univariate polynomial 

V checks that g,, is a univariate polynomial of degree at most degj,(g), and that 
8v -1 {rj- 1 ) = gy (0) + gv (1), rejecting if not. 

• V chooses a random element ry G F and evaluates g(ri,... ,t\,) with a single 
oracle query to g. V checks that gv(rv) = g(ri,..., r^), rejecting if not. 

• If V has not yet rejected, V halts and accepts. 


Figure 1: Description of the sum-check protocol. deg;(g) denotes the degree of g in the ith variable. 


2.1.2 The PointQuery and RangeCount Protocols 

An instance of the PointQuery problem consists of a stream of updates as described above followed by a query q G \u]. 
The goal is to compute the coordinate a^. For RangeCount problem, let be a range space and the input consist 

of a stream T of elements (with size ri) from the data universe % (with size u), followed by a range R G The goal is 
to verify a claim by P that n t| —k. 

Theorem 2.2 (Chakrabarti et al. Q). Suppose the input to PointQuery satisfies |a, | < A at the end of the stream, 
for some known A. Then there is a two-message SIP for PointQuery on an input stream with length n, with space 
and communication each bounded by 0{\ogu • log(A + logw)). For RangeCount, there is a two-message SIP for 
RangeCount with space and communication cost bounded by (9(log(|t^|) • log(n • \ f^\))- In particular, for range spaces 
of bounded shatter dimension p, log|,^| = plogn = (9(logn). 

2.1.3 Sum-Check Protocol 

Properties and Costs of the Sum-check Protocol. The sum-check protocol satisfies perfect completeness, and has 
soundness error e < deg(g)/|F|, where deg(g) denotes the total degree of g (see pj2\ for a proof). There is one round 
of prover-verifier interaction in the sum-check protocol for each of the v variables of g, and the total communication is 
(9(deg(g)) field elements. 

Note that as described in Figure[2 the sum-check protocol assumes that the verifier has oracle access to g. However, 
this will not be the case in applications, as g will ultimately be a polynomial that depends on the input data stream. In 
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order to apply the sum-check protocol in a streaming setting, it is necessary to assume that V can evaluate g at any point 
r in small space with a single streaming pass over the input (this assumption is made in Theorem |2. 3 | l. Alternatively, 
one can have the prover tell the verifier g(r), and then prove to the verifier that the value g(r) is as claimed, using 
further applications of the sum-check protocol, or heavier hammers such as the GKR protocol (As described below), 
which is itself based on the sum-check protocol. 


Theorem 2.3 (Streaming Sum Check Protocol 1121). Let g be a v-variate polynomial over F, which may depend on the 
input stream X. Denote the degree of g in variable i by degf g). Assume V can evaluate g at any point r € F with a 
streaming pass over X, using 0{v ■ log |F|) bits of space. There is an SIP for computing the function F{x) = LagF'' 
that uses 0{v) messages and degj(g) • log |F|) communication, as well as 0{v ■ log |F|) space. 


For completeness, we present description of the sum-check protocol of Lund et al. p2) in Figure 


2.1.4 The GKR Protocol 

Interactive proofs can be designed by algebrizing a circuit computing a function. One of the most powerful protocols of 
this form is due to Goldwasser et al. and known as the GKR protocol. This was adapted to the streaming setting 
by Cormode et al. 112|, yielding the following result. 


Lemma 2.4 ( 1 12 161). Let F be a finite field, and let f: F“ —>■ F be a function of the entries of the frequency vector of 
a data stream (viewing the entries as elements of¥). Suppose that f can be computed by an G(log(S) • log(|F|))-j/7ace 
unifonn arithmetic circuit ^ (over F) of fan-in 2, size S, and depth d, with the inputs of^ being the entries of the 
frequency vector. Then, assuming that |F| = D,{d ■ logS), / possesses an SIP requiring 0{d ■ logS) rounds. The total 
space cost is G(logM • log |F|) and the total communication cost is 0(d ■ log(S) • log |F|). 


3 Rectangular Matrix Multiplication and Eigenstructure 

Many algorithms in data analysis require computation of the eigenpairs (eigenvalues and eigenvectors) of a large data 
matrix. Eigenvalues of a streamed nxn matrix can be computed approximately without a prover Q, but there are no 
streaming algorithms to compute the eigenvectors of a matrix because of the output size. 

Verifying the eigenstmcture of a symmetric matrix A is more difficult than merely verifying that a claimed (A,v) is 
an eigenpair. This is because the prover must convince the verifier not only that each (A,, v,) satisfies Av = Av, but 
that the collection of eigenvectors together are orthogonal. Thus, the prover must prove that VV~^ = D where V is 
the collection of eigenvectors and D is some diagonal matrix. Note however that this matrix multiplication check is 
rectangular: if we wish to verify that a collection of k eigenvectors are orthogonal, we must multiply akxn matrix V 
by an n X k matrix V~^. 

We present an annotation protocol called MatrixMultiplication to verify such a rectangular matrix multiplication. 
Our protocol builds on the optimal annotations protocols for inner product and matrix-vector multiplication from Q 
and pT) . We prove that our MatrixMultiplication protocol obtains tradeoffs between communication and space usage 
that are optimal up to a factor of O (min {k,k')). 

Theorem 3.1. Let A be akxn matrix and B annxk' matrix, both with entries in a finite field F of size 6n^ < |F| < 6n^. 
Let (h, v) be any pair of positive integers such that h-v>n. There is a annotated data streaming protocol for computing 
the product matrix C = A B with communication cost 0{k -k' ■ h - logn) bits and space cost 0(v ■ logn) bits. Moreover, 
any (online) annotated data streaming protocol for the problem requires the product of the space and communication 
costs to be at least Q,((k-\-k') ■ n). 

Proof. To present the upper bound, we first recall the inner product protocol of Chakrabarti et al. ©■ Given input 
vectors a,b G F", the verifier in this protocol treats the n entries of a and as a grid [h] x [v], and considers the unique 
bivariate polynomials d{X,Y) and b{X,Y) over F of degree at most h in X and v in T satisfying d{x,y) = a(x,y) and 
b(x,y) = b(x,y) for all (x,y) G [h] x [v]. The verifier picks a random r G F, and evaluates d(r,y) and b{r,y) for all y G [v]. 
As observed in |[^, the verifier can compute a{r,y) for any y G [v] in space (9(log |F|), with a single streaming pass over 
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the input. Hence, the verifier’s total space usage is 0{v ■ log |F|). The prover then sends a univariate polynomial i(X) of 
degree at most h, claimed to equal g(X) = Lye[v] ' b{X,y). The verifier accepts 'E.xelh] ■s(^) the correct answer 

if and only if s{r) = a{r,y) ■ h{r,y). 

Returning the matrix multiplication, let us denote the rows of A by ai,..., and the columns of Z? by bj,..., bjt- 
Notice that each entry Cij of C is the inner product of a, and by. 

Tbe prover’s computation. In our matrix multiplication protocol, the prover simply runs the above inner product 
protocol A: • A:'times, one for each entry C,; of C. This requires sending A:-A:'polynomials, : {i,j) G [k] x [A:'], each 
of degree at most h. Hence, the total communication cost is 0{k -k' -h- logn). 

Tbe verifier’s computation while observing entries of A. The verifier picks a random a and computes, for each 
y G [v], the quantity Sy := Y.iai{r,y)a'. Using standard techniques |[^, the verifier can compute each with a single 
streaming pass over the entries of A, in 0{\ogn) space. Hence, the verifier can compute all of the Sy values in total 
space 0{v ■ logn). 

The verifier’s computation while observing entries of B. For each y G [v], the verifier computes the quantity 

Sy : = Y^jek' bj {r, y)a^'^ . The reason that we define s'y in this way is because it ensures that Sy ■ s'y = ll{i.j)e [k] x [k'] (^i y) ' 

bj{r,y)a^'j^‘, which is just a fingerprint of the set of values {a, (r,y) • bj{r,y)} as {i,j) ranges over [k] x [k']. 

To check that all Sij polynomials are as claimed, the verifier does the following. As the verifier reads the Sij 
polynomials, she computes a fingerprint of the i,,y (r) values, i.e., the verifier computes ' ct-'The verifier 

checks whether this equals • i),). If so, the verifier is convinced that Ay = i,y(.r) for all {i,j) G [k] x [U]. If 

not, the verifier rejects. 

Proof of completeness. If the Sij polynomials are as claimed, then; 

E 8ij{r) ■ = E E 3')'3') 

ije[k]x[k'] i,je[k]x[k']ye[v] 

= E E aiir,y)-bj{r,y)aJ'‘+‘= Y, Sy-Sy. 

ye[v\i,ie[k\x[k'\ ye[v] 

Proof of soundness. If any of the Sij polynomials are not as claimed (i.e., if Sij{X) ^ gij{X) as formal polynomials), 

then with probability at least 1 — /!/|F| over the random choice of r S F, it will hold that Sij{r) 7 ^ gij{r)- In this event 
the verifier will wind up comparing the fingerprints of two different vectors, namely the k ■ A:'-dimensional vector whose 
(/,y)’th entry is Sj j{r), and the A:• A:'-dimensional vector vector whose (/,J)’th entry is Yyelv]^i{fjy) 'bj{r,y). These 
fingerprints will disagree with probability at least 1 — A: • A:'/|F|. Hence, the probability that the prover convinces the 
verifier to accept is at most h/\¥\ +k- A:'/|F|. If |F| > 100 ■ h- k-k', the soundness eri'or will be bounded by 1/50. 

Lower bound. Cormode et al. E) proved a lower bound on the cost of (online) annotated data streaming protocols for 
matrix-vector multiplication (i.e., for multiplying akxn matrix A by an « x 1 matrix B). Specifically, their argument 
implies that if A is A: x n, then any protocol for multiplying A by a vector must have the product of the space and 
communication costs be at least Q.{k ■ n). The claimed lower bound follows if A: > A:' (the case of A: < A:' is analogous). 

□ 


On V’s and P’s runtimes. Using Fast Fourier Transform techniques (cf. 110 Section 2]), the prover in the protocol of 
Theorem |3.1| can run in 0{k ■ k' ■ nlogn) total time, assuming the total number of updates to the input matrices A, B is 
0{k -k' -n logn). The verifier can run in time 6>(logn) per stream update. 


The Eigenpair Verification Protocol. We now show how to use Theorem 3.1 to verify that a claimed set of k 
eigenvalues and eigenvectors are indeed (approximate) eigenparrs of a given symmetric integer input matrix A. The 
protocol is cleanest to present assuming the entries of all of the claimed eigenvectors are integers, in which case the 
protocol can verify that the vectors are exact eigenvectors. We explain how to handle the general case at the end of the 
section. 


The case where all claimed eigenvectors have integer entries. The eigenpair verification protocol invokes Matrix- 
Multiplication twice. In the first invocation, MatrixMultiplication is used to simultaneously verify that all claimed 
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eigenpairs are indeed eigenpairs. Specifically, the MatrixMultiplication protocol is used to compute C = A V, where 
V is the matrix whose ith column equals the ith claimed eigenvector v,. The verifier use fingerprints to check that 
C = V D, where D is the diagonal matrix with entries corresponding to the claimed eigenvalues. In the second invoca¬ 
tion, MatrixMultiplication is used to check that the claimed eigenvectors are orthogonal, by verifying that V~^V = D' for 
some diagonal matrix D' provided by the proven Note that in both invocations of the MatrixMultiplication protocol, the 
verifier does not have the space to explicitly store the matrix V. Fortunately, storing V is not necessary, as within both 
invocations of the MatrixMultiplication protocol, V is treated as part of the input stream, and the MatrixMultiplication 
protocol does not require the verifier to store the input. 

The general case. We now sketch at a high level how to handle the general case, in which the entries of the claimed 
eigenvalues are not integers (note that since A is symmetric, the entries of all of its eigenvalues can be taken to be 
real). The protocol guarantees in this general case that, for any desired error parameter £, each claimed eigenpair 
(A,, V,) satisfies ||Av, — A,v, ||2 < £■ The approach we take to handle non-integer entries is exactly as in the eigenvalue- 
verification protocol of Cormode et al. El- Specifically, we reduce to the integer case by requiring the prover to round 
the entries of all claimed eigenvectors and eigenvalues to an integer multiple of e' for some sufficiently small value e', 
in such a way that the resulting eigenvectors are exactly orthogonal. It is straightforward to show that there is some 
e' = 1 /poly(n,e^*) such that the rounding changes each entry of Av, by at most e/n^. This ensures that the matrix 
V /e' is has integer entries, all bounded in absolute value by poly(n/e). Hence, each entry of V je' can be identified 
with an element of a finite field of size poly(n,e^*), and we can apply the integer matrix multiplication protocol to 
compute A • {V je') and (y/e')^(y/e'). The verifier checks that the latter result is a diagonal matrix, guaranteeing that 
the claimed eigenvectors are orthogonal. Given the former result, it is straightforward for the prover to convince the 
verifier that each entry of the former matrix is close enough to (y/e') • D to ensure that ||Av, — Xi\i\\2 < £■ 

Theorem 3.2. Let A be a symmetric n x n integer matrix with entries bounded in absolute value by poly{n). Let k 
be an integer, let h and v be positive integers satisfying h-v>n and let £ > 0 be an error parameter. Then there 
is an annotated data streaming protocol for verifying that a collection of k eigenpairs (A, , V/) are orthogonal, and 
each satisfies ||Av, — A,v ,||2 < £■ The total communication cost is 0{k^ ■ h ■ log(n/e)) and the verifier’s space cost is 
0{v-login/£)). 

4 Shape Analysis in a Few Rounds 

In this section, we give 3-message SIPs of polylogarithmic cost for finding an MEB and computing the width of a 
point set. The key here is to identify a sparse dual witness that proves optimality (or near-optimality) of the claimed 
(primal) solution and then check feasibility of both primal and dual solutions. We show how the verifier can perform 
both feasibility checks via a careful reduction to an instance of the RangeCount problem. 

4.1 Verifying Minimum Enclosing Balls 

Consider the Euclidean k-center problem with k = I, otherwise known as the MEB: given a set of n points P C in 
which find a ball B* of minimum radius that encloses all of them. 

The MEB presents an interesting contrast between our model and the classical streaming model. It is known that no 
streaming algorithm that uses poly((f) space can approximate the MEB of a set of points to better than a factor of 
by a coreset-based construction and in general ||^. Also, the best streaming multiplicative (1-1- e)-approximation 
for the MEB uses (9((l/e)2) space |[^. 

4.1.1 The Protocol 

The prover reads the input and sends the (claimed) minimum enclosing ball B. Our protocol reduces checking feasibility 
and optimality of B to carefully constructed instances of the RangeCount problem. 

Checking Feasihility. We consider a new range space, in which the range set is defined to consist of all balls with 
radius j: j G {0,1,..., m^} and with centers in [mY. Notice that \i^\ = 0{m^). Using the protocol for RangeCount 
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(Theorem |2.2| i, we can verify that the claimed solution B does in fact cover all points (because this will hold if and only 
if the range count of B equals the cardinality of the input point set |P| = n). 

Checking Optimality. We will make use of the following well known fact about minimal enclosing balls, which was 
used as the main idea for developing an approximation algorithm for furthest neighbour problem, by Goel et al. |T5}: 

Lemma 4.1. Let B* be the minimal enclosing ball of a set of points P in Then there exist at most d + 2 points of P 
that lie on the boundary dB* ofB* and contain the center ofB* in their convex hull. 


Putting it all Together. The complete 3-message MEB protocol works as follows. 


1 . 

2 . 


V processes the data stream for RangeCount (with respect to and P). 


P computes the MEB B* of P, then rounds the center c of the MEB to the nearest grid vertex. Denote this vertex 
by c*. P sends c* to V, as well as the radius r of B*, and a subset of points T G Pin which MEB(7’) = MEB(P). 
(Note that based on Lemma 4.1 jrl < t/-|-2 suffices). 


3. V first computes the center c of the MEB for the subset T and checks if c* is actually the rounded value of c. 
Then V mns a RangeCount protocol with P to verify that the ball of radius r+l and center c* contains all of the 
input points. It then runs multiple copies of PointQuery to verify that the subset \ T\< d+ 2 points provided by P 
are actually in the input set P. 


Theorem 4.2. There exists a 3-message SIP for the Minimum Enclosing Ball (MEB) problem with communication and 
space cost bounded by Oicfi ■ log^m). 


On V’s and P’s runtimes. Assuming the distance function D under which the instance of MEB is defined satisfies 
mild “efficient-computability” properties, both V and P can be made to run in total time polylog(m'^) per stream update 
in the protocol of Theorem 4.2 Specifically, it is enough that for any point xG P, there is a De-Morgan formula of size 
polylog(m‘^) that takes as input the binary representation of a ball B G and outputs 1 if and only if x G B. Under the 
same assumption on D, the prover P can be made to run in time T -\-n- polylog(m^), where T is the time required to 
find the MEB of the input point set P. Eor details, see the full description of the PointQuery protocol of 0. 


4.1.2 Streaming lower bounds on the grid 

We note that restricting the points to a grid does not make the MEB problem easier for a streaming algorithm. Here we 
show that lower bound for streaming MEB due to Agarwal and Sharathkumar Q can be modified to work even if the 
points lie on a grid. The key lemma in Agarwal and Sharathkumar’s lower bound is a construction of a collection of 
almost orthogonal vectors that are centrally symmetric. Let denote the unit sphere in . 

Lemma 4.3 (Agarwal and Sharathkumar Q). There is a centrally symmetric point set K C of size n(exp((f J)) 

such that for any pair of distinct points p,q G K if p —q, then 

s/2{l-\)<\\p-q\\<V2{\ + ^) (1) 

di c/3 

This point set is then used by an adversary to “defeat” any algorithm claiming a v/2 — 5 approximation. Note that 
the “almost orthogonal” property follows from the observation that for unit vectors p,q, \\p — q\\'^ = 2 — 2{p,q) and 
therefore the condition of the lemma above implies that {p,q) < 

cn 

It turns out that this “almost-orthogonal” property can be achieved by vectors with integer coordinates. The proof is 
in the same spirit of the proofs that sign matrices can be used in the Johnson-Lindenstrauss lemma, and follows from an 
observation by Ryan O’Donnell | |^ . We recreate the proof here for completeness. 

Lemma 4.4 (Bernstein’s inequality). LetX\,... be independent Bernoulli variables taking values in {-|-1, — 1} with 
equal probability. Then 

Pr[\^ J^Xil >e]< 2exp {-de^ (2(1 + e/3))). 
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Lemma 4.5. Let t = exp( ^). Let ui,..., Uj be random vectors in which each entry is set to \ / s/d or — Xj s/d, with 
probability j each. There is a positive probability of \ (ui, uj) | < e holding for all i j. 


Proof We define variables Xij as the Bernoulli variables corresponding to Lemma 4.5 where i < k,j < d. That is, 
dehne the x,/ variables such that; 


We want to analyze the behavior of (u,',u/). Set yI/ = xikXjk and write (u,,U;) as ^ ■ Note that for each i, 

j, and k, Yj/ is a Bernoulli variable with range {—1,+1}, and for any fixed i,j, the variables Yj/ are independent. 
Therefore, we can apply Bernstein’s inequality to the collection {Y'j/} for a hxed i,j. 

For simplicity, assume that £ < 1. Then Bernstein’s inequality implies that 


Pr[|(ui,U;)| > £] < 2exp(-ii£^/4). 


It follows that the probability that |(u,,Uj)| > £ is at most 2exp(—^f£^/4). Now if we set t = exp(^), then this 
probability value equals ^ by choice of t and hence by taking a union bound over at most G)< ^ pairs of {i, j) we 
conclude that there is a positive probability of | (u, ,U;)] < £ holding for all i j. □ 


4.2 Verifying the Width of a Point Set 

Let the width of a point set be the minimum distance between two parallel hyperplanes that enclose it. Like the MEB 
problem, the width of a point set can be approximated by a streaming algorithm using 6>(l/£'^(‘^)) space [j^, without 
access to a proven 

We present a similar protocol for verifying the width of a point set as follows: We describe an efficient constant- 
round SIP to exactly compute the width of a point set. As before, we study the problem in the discrete setting, i.e., we 
assume that the data stream elements are a subset of points over a grid structure 'W = [mf. Let ^ denote the set of all 
the ranges dehned by single slab (i.e., each range consists of the area between some two parallel hyperplanes). 


4.2.1 Certificate of Optimality 


Given a slab S that is claimed to be a minimal-width slab covering the input point set P, the following lemma (akin to 
Lemma 4.11 guarantees the existence of a sparse witness of optimality for S. 


Lemma 4.6. Given the input point set P in d-dimensions, every optimal-width single slab S consisting of the area 
between parallel hyperplanes h\,h 2 covering P can be described by a set ofk -X-lf — d -\-X points from input point set P, 
in which kpoints lie on the hyperplane h\ and k' points lie on the hyperplane /i2- 


Proof We express S as an optimal solution to a certain linear program. We then infer the existence of the claimed 
witness of optimality for S via strong linear programming duality and complementary slackness. 

Assume the two hyperplanes specifying S are of the form hi : (w,x) = 1 and /12 : (w,x) = i, where w £ Then 
the pair (w,£) corresponds to an optimal solution of the following linear program: 


min £ 


s.t. Vi e {1,...,|P|} (w,Xi) > 1 
\/i£ \P\} (w,x,')<£ 


We write the LP in the standard form: 


max 

s.t. Vi e {1,...,|P|} 
ViG{l,...,|P|} 


-£ 

(-xf)-w> 1 
xf • w - £ < 0 
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Let;c,y denote the y'th entry of input point x,- G Standard manipulations reveal the dual. 

jpl 

min 

1=1 

jPl 

s.t. V; e {1,..., ^ (y; - Zi)xij = 0. 

1=1 

\p\ 

i=l 

Let y = (yi,... ,y\p\ ) and z = (zi,... ,z\p\ ) denote an optimal solution to the above dual. 


Claim 1. For any i, y, and Zi cannot both be nonzero. 

Proof. By complementary slackness, y, and z, are both nonzero only if both of the corresponding primal inequalities 
are tight, which can only hold if the width is zero. □ 


Claim 2. In total, the number of nonzero entries in y and z must be at least d+l. 


Proof. Fix any j G d} and consider the constraint 

\p\ |/>| 

Y^yiXij = Y^ZiXij ( 2 ) 

1=1 1=1 

from the dual. Note that by Claim 1, all the x^’s with a non-zero coefficient y, in the left hand side of Equation (|^ 
are distinct from the x,/s with a non-zero coefficient z/ on the right hand side of Equation ([^. Suppose by way of 
contradiction that there are at most d nonzero entries in total in y and z. Fix one such non-zero entry, say, Zk- We can 
rewrite Equation (|^ as; 

\p\ 

ZkXkj = Y^yiXij -Y^ZiXij 
1=1 i/i: 


and by dividing by Zk and relabeling the coefficients, we get: 

\p\ 

Xk j — ^fXi j 

1=1 

for some coefficients ai,..., a\p\ G K, where at most li — 1 of the a,’s are non-zero. But this says that there exist d 
points not in general position, which is a contradiction. Therefore Claim 2 is true. □ 

□ 


Now using Lemma 4.6 we can give the following upper bound for the size of the range set ff, in the one-slab 
problem on = [m]^. 

Lemma 4.7. Given a grid the size of the range set Si consisting of all slabs is 


Proof. Based on Lemma 
points. Thus we have: 


4.6 


each slab on the grid [m] can be determined by two parallel hyperplanes including d-\-\ 



2nf 

d+\ 


= 0{m‘^ +'') 


□ 
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4.2.2 The Protocol 


The protocol works as follows: 


1. V processes the data stream as if for a RangeCount query with respect to . 
including k + k' = d+ \ points. 


defined as the set of single slabs 


2 . 


P returns a candidate slab S consisting of two parallel hyperplanes h\,h 2 , claimed as the slab with minimum 
width which covers all the points P in input data stream. P also sends a set T\ of k points and a set Tj of k' points 
claimed to satisfy the properties of Lemma ‘ 


1.6 


3. V verifies that if k~\-k' = (i+ 1, checks that all points in Ti lie on hi and all points in T 2 lie on h 2 , and runs the 
PointQuery protocol d+\ times to check that all points in T\ U T 2 actually appeared in the input set P. 

4. V initiates a RangeCount query for the range corresponding to the slab S, and verifies that the answer is n = |P|, 
i.e., that S covers all the input points. 

Perfect completeness of the protocol is immediate from Lemma [4!^ and the completeness of the PointQuery and 
RangeCount protocols. The soundness error of the protocol is at most {d + 2) ■ e^, where < 2 ,(d^ 2 ) upper bound 
on the soundness errors of the PointQuery and RangeCount protocols. To see this, note that if T\ and T 2 are as claimed, 
then there is no slab of width less than that of S covering the input points. And the probability that the verifier accepts 
when T\ and T 2 are not as claimed is bounded by {d + 2) ■ e*, via a union bound over all (c/+ 1) invocations of the 
PointQuery protocol and the single invocation of the RangeCount protocol. Theorem |4^ follows. 


In the protocol of Theorem 4.8 the prover and verifier can be made to satisfy the same runtimes bounds as in the 
MEB protocol of Section [O] assuming the distinct function D satisfies the same “efficient computability” condition 
discussed there. 

Theorem 4.8. Given a stream ofn input points from = [m]‘^, there is a three-message SIP for verifying the width of 

the input with space and communication cost bounded by 0{d^ ■ log^m). 


4.3 Verifying Approximate Metric ^-Centers 

Using the same ideas as for the MEB, we can verify a 2-approximation to the metric k-center problem. At a high level, 
the SIP verifies the correctness of the witness produced by running Gonzalez’ approximation algorithm for metric 
k-center clustering namely, k + 1 points that are at least distance r apart (where r is the claimed 2-approximate 
radius). This sparse witness can be verified using the PointQuery protocol. 

Here we describe the formalization of the metric k-Center problem and the protocol in details and then the main 
result follows. 

A k-center clustering of a set of points pi,... ,pn in a metric space {X,d) is a set of k centers ‘rf = {ci,.. .q}. The cost 
of such a clustering is 


cost(C) = max mint/(/7,,Cj). 
i j 

Definition 4.9. Let {X,d) be a metric space. Let pi,p 2 ,...,pn,k be a stream of points from {X,d) followed by 
parameter k. An SIP computing a 2-approximation for the metric k-center problem with completeness error e^. and 
soundness error gj has the following form. The prover begins the SIP by claiming that there exists a k-center clustering 
of cost r*. 

• If this claim is true, the verifier must accept with probability at least 1 — e^. 

• If there is no k-center clustering of cost at most r* 12, the verifier must reject with probability at least 1 — e^. 

It is easy to provide a protocol that works deterministically if the verifier is not required to process the input in a 
streaming manner. This is the standard 2-approximation algorithm of Gonzalez: the prover provides 

Proof of Feasibility. A set of centers ci,... satisfying maximinj d{pi,Cj) < r* and 
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Proof of Approximate Optimality. A set of A: + 1 points from the stream with the promise that 

mmi jd{ui,Uj) > r*. 

This guarantees a 2-approximation by the standard argument relying on the triangle inequality flTl . The veriher can 
easily check that the relevant conditions hold. 


The SIP. Let ^ix) = {y S 2f | d{x,y) < r} denote a ball of radius r with center x in the metric space. We dehne a 
range space ^ consisting of all unions of k balls of radius r for all values of r: 


^ = {^zezBdA^) |ZCA,|Z| = k,3x,y eX,d{x,y) = r} 

Note that \^\ = where m is the size of metric space, i.e. |A| = m. 

The protocol works as follows: 


1. V processes the data stream as if for a RangeCount query using range space as well as for k+l parallel 
PointQuery queries. 

2. P returns a candidate clustering ci,C 2 ,... with the claimed cost r*, as well as k + 1 points mi,... from 
the stream witnessing (approximate) optimality. 

3. V initiates a RangeCount query for the range ,■* (ci) verihes that the answer is n = |P|. 

4. V verihes that the distance between all distinct pairs of points {ui^uj) is at least r*, and invokes (k+l) PointQuery 
queries to ensure that each m, appeared in the input stream. 


The correctness of the protocol follows from the correctness of Gonzalez’s algorithm and Theorem 2.2 Note that 
approximating metric k-center to within a factor of 2 — £ is NP-hard | |T4| . The above protocol is a streaming variant 
of an MA protocol. Under the widely-believed assumption that MA = NP, there is no 2 — £ approximation for metric 
k-center with a polynomial-time veriher, regardless of whether the veriher processes the input in a streaming manner. 

As with our protocols for the MEB problem and computing the width of a point set, V and P can be made to run in 
quasilinear time if the metric d satishes mild efficient-computability properties. 


Theorem 4.10. Let (X,d) be a metric space in which |A| = m. Given an input point set |P| = nfrom {X,d), there is a 
streaming interactive protocol for verifying k-center clustering on P with space and communication costs bounded by 
0{k-\-\og{\3f\) ■ log(n • in which |^| < m^+^. 


5 SIPs for General Clustering Problems 

In this section, we give SIPs for two very general clustering problems: the k-center problem, and the k-slab problem. 
In the k-center problem, given a set of n points in [mY, the goal is to hnd k centers so as to minimize the maximum 
point-center distance. In the k-slab problem, the goal is instead to hnd k hyperplanes so as to minimize the maximum 
point-hyperplane distance. 

5.1 it-Slabs 

We hrst consider the k-slab problem. Even when k = 2 (and d = 3), this problem appears to be difficult to solve 
efficiently without access to a prover: in fact, it was shown that this problem does not admit a core set for arbitrary 
inputs | |l^ . Later, Edwards et al. | |l3| showed that if the input points are from = [mY (as in our case), then there 

exists a coreset with size at most ^ (exponential in dimension d), which provides a (1 + £)-approximation to 

k-slab problem. However, k-slab problem does not admit a streaming algorithm to the best of our knowledge. As before, 
we can think of a “cluster” as described not by a single hyperplane, but as the region between two parallel hyperplanes 
that contain all the points in that cluster. The width of the cluster is the distance between the two hyperplanes. We now 
think of the k-slab objective as minimizing the maximum width of a cluster, a quantity we call the width of the k-slab. 
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Defining the Relevant Range Space. Each slab can be described by c/ + 1 points (that define the hyperplane) in 
= [inf and a width parameter. A A:-slab is a collection of k of such slabs. Let be the range space consisting set of 
all A:-slahs. This range space has size |91| = +2W pgj. ^jjy ^-sjah (7 S let w((7) denote its width. We will assume 

a canonical ordering of the ranges (7i , ( 12 ,..., in increasing order of width (with an arbitrary ordering among ranges 
having the same width), as well as an effective enumeration procedure that given an index i returns the range in the 
canonical order. We will also assume the existence of a mapping function ^ :IR— >{ — !,...,|91| — 1} which maps a 
width value w to the smallest index i such that w((7,) = w, and to the null value —1 otherwise. Notice that the verifier 
can compute this mapping function by explicit enumeration, using only enough space to store one range. 

Stream Observation Phase of the SIP. Let t = {pi,p2, ■ ■ ■ ,Pn) be the stream of input points. As the verifier sees 
the data points, it generates a derived stream %' as follows. Lor each point pi in the actual input stream T, V inserts into 
t' all A:-slabs (7 G 91 which contain the point p,. Notice that t' is a deterministic function of T, and hence the prover P, 
who sees T, can also materialize t', with no communication from V to P required to specify r'l^Note that the frequency 
fa of the range a in this derived stream t' is the number of points that a contains. 


Proving Feasibility. After the stream t has passed, P supplies a candidate A:-slah a 
width w* = w((7*). By applying the RangeCount protocol from Theorem 


2.2 


and claims that this has optimal 
to the derived stream t', V can check that 


fa —n and is therefore feasible. This feasibility check requires only 3 messages. 


Optimality. Proving optimality is more involved and for that we use GKR protocol as follows. 

The verifier must check if the optimal width is w as claimed by the prover. Given a subset 5 C 91 of k-slabs, 
let I 5 : { 0 , l}*°sl^l { 0 , 1 } denote the indicator function that evaluates to 1 on the binary representation of a range 
a of a k-slab if a G 5, and evaluates to 0 otherwise. Let S := {a: w((7) < w*}, and let T — {a: fa «}. Let 
F — L<TG 9 {Is(< 7 )Ir(o’)- Then the prover has supplied an optimal range ff* if and only if F = |5|. Note that effectively 
we are summing Ir((7) over a prefix of the sorted list of ranges, namely those in S. 

Let F be a field of prime order satisfying 6 n^ < |F| < 6 n^. Let I 5 : ^ F be the multilinear extension of 

Is, and let Ir be the multilinear extension of Ij-. That is. Is is the unique multilinear polynomial over F satisfying 
l5((7) = Is(o’) for all (7 G {0, l}*°8l^l, and similarly for Ij-. It is standard that 

h= Y. I 5 ((^)-Za, where (3) 


log |5t| 

:j;CT(xi,...,Xiog|5R|) := n (xiC7; + (l-X,)(l-(7,-)), 
1=1 


(4) 


and similarly for Ij-. To compute F, it suffices to apply the sum-check protocol to the polynomial g : = Is • Ir. The 
protocol requires log|91| rounds, and the total communication cost is (9 (log 1911) field elements. To perform the 
necessary check in the final round of this protocol, V needs to evaluate g at a random point r G F*°sl^L By definition 
of g, it suffices for V to evaluate l 7 ’(r) and Is(r). Since the set S does not depend on the stream {S depends only 
on the claimed optimal width w*), V can evaluate Is(r) after the stream has passed, using (9(log(|91|) • log|F|) bits 
of space, using standard techniques (see for example ]12| Section 2]). However, it is not possible for V to evaluate 
l 7 '(r) in a streaming manner. Instead, V asks P to tell her l 7 -(r), and checks that l 7 ’(r) by invoking the streaming 
implementation of the GKR protocol (cf. Lemma [2~4| ). More precisely, similar to [10| Section 3.3], we observe that 
Fermat’s Little Theorem implies that fa f=n if and only if {fa — = 1 mod |F|. This implies via Equation ([^ 

that l 7 ’(r) = Logjo (/<r ~ where Xa was defined in Equation Q. As in 110 Section 3.3], it is 

possible to compute the right hand side of this equality by a log-space uniform arithmetic circuit of size (9(|91|) 
and depth 0{log |F|) = (9(logn) over F. By applying the GKR protocol to V forces P to faithfully provide l 7 ’(r). 
This completes the protocol. Completeness and soundness follow from completeness and soundness of the sum-check 


“^The running time cost increase for the mapping function and the derived stream can be avoided by observing that the frequency vector /a is not 
arbitrary, since it tracks membership in ranges. This trick is described in (7j and allows us to modify the extension polynomial used to report entries 
of the vector without needing to write down the explicit derived stream. Also see the discussion in Sectionpl 
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protocol and of the GKR protocol. It is straightforward to check that the the protocol has the claimed space and 
communication costs. 

Protocol Costs. The total communication cost of the protocol 0(logn ■ log(|91|) • log |F|) = 0{k-d^ ■ logm • log^n) 
bits. The total space cost is 6>(log(|91|) • log(|F|)) = 0{k ■ ■ logm ■ logn) bits. The total number of rounds required is 

(9(logn -logd^RD) = 0{k-d^ ■ logm - logn). 


Theorem 5.1. Given a stream ofn points, there is a streaming interactive proof for computing the optimal k-slab, with 
space and communication bounded by 0{k ■ d^ ■ logm • log^ n). The total number of rounds is 0{k ■ d^ ■ logm • logn). 


5.1 


We note that it is possible to both avoid using the GKR protocol and reduce the number of rounds in Theorem 
by a factor of log(n), using a techniqne introduced by Gur and Raz |18) , and applied by Klauck and Prakash p7] to 
obtain an 6>(log |91|)-round SIP for computing the number of distinct items in a data stream. However, these techniques 
sacrihce perfect completeness, and increase the communication complexity of the protocol by polylogarithmic factors. 
We omit the details of this technique for brevity. 


5.2 ^-Center 

We can use the same idea as above to verify solutions for Euclidean k-center. The relevant range space here consists of 
unions of k balls of radius r, for all choices of centers and radii in the grid. The size of this range space is m^^‘^. We 
omit further details and merely state the main result. 

Theorem 5.2. Given a stream ofn input points, there is an SIP for computing the optimal k-center with space and 
communication bounded by 0{k ■ d ■ logm • log^ n). The total number of rounds is 0{k ■ d ■ logm • logn). 
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